閒聊
在經過前8天的基礎練習之後,今天終於要進入到第一次爬蟲了!
決定從書籍和網路上找尋第一隻函式庫,最後選定Requests。
Requests函式庫
什麼是Requests?
Requests是Python HTTP庫(外部函式庫),目的是使HTTP請求更簡單、更人性化,且具備了GET、POST...等用法。
安裝
pip install requests
import requests
或是
from request import get
get
向指定資源提交請求,可以設定params參數字典。
post
向指定資源提交請求,可以設定data參數字典。
put
向指定資源提供最新內容,可以設定data參數字典。
r = requests.put('url', data = {'key':'value'})
r = requests.delete('url')
r = requests.head('url')
r = requests.options('url')
print(r.text) #回應訊息str(字串)
print(r.encoding) #回應、指定訊息編碼
print(r.url) #回應資源的URL地址
print(r.stutus_code) #回應狀態(int)
print(r.header) #回應訊息的標題(dict)
print(r.cookie) #回應訊息的cookie(dict)
print(r.history) #請求歷史(list)
4.JSON資料
如果取得的資料是JSON格式,可以使用.josn()
將訊息解碼後回傳(dict)。
r = requests.get('url')
r.josn()
5.自訂Header
有時候網站會擋掉python-request的請求,因此會需要自訂Header。
其中需要注意的是,變數名字中間只能用-
分隔,跟平常習慣的_
是不一樣的!
並且,非標準協定頭欄位需要加上x-
作為標示。
url = 'URL地址'
headers = {'變數':'變數'}
r = r.request(url, headers = headers)
6.Timeout
可以用來檢查是否可以存取,或是避免在維修中或是故障的網站停留。
requests.get('url',timeout = [SECOND]) #以「秒」為單位。
7.取得及修改Cookie
#取得
url = 'URL'
r = requests.get(url)
r.cookie['example_cookie_name']
#修改
url = 'URL'
cookie = dict(cookie_are = '')
r = request.get(url, cookie = cookie)
r.text
HTTPS 狀態代碼
狀態代碼 | 說明 |
---|---|
200 | 網頁正常 |
301 | 網頁搬家,會重新導向新的URL |
302 | 暫時移到新位置 |
400 | 錯誤的要求 |
401 | 未授權,需攜帶憑證 |
403 | 沒有權限 |
404 | 找不到網頁 |
500 | 伺服器錯誤 |
502 | 伺服器某個服務沒有正確執行 |
503 | 伺服器暫時無法處理請求(流量附載過大) |
504 | 伺服器沒有回應 |
Requests GET
GET提交的參數會在標頭中傳送(公開)
第一步先來請求get
import requests
url = 'https://www.google.com.tw/?hl=zh_TW'
r = requests.get(url) #對url發送GET請求
print(type(url),r) #印出狀態
#output
<class 'str'> <Response [200]>
GET請求中,如果請求攜帶參數會直接放在網址中(url)?
後面。如果有多個參數,就以&
相隔。
例如https://www.google.com.tw/?hl=zh_TW
這裡只有一個參數,所以是接在?
後面。這個參數為h1
,參數值則是zh_TW
。
auth
指定帳號、密碼。r = requests.get('url', auth = ('user', 'pass'))
Requests POST
POST提交的參數會在內容中傳送(隱密)
大部分會應用在網頁讓使用者填入資料的表單,使用POST來做請求。
第一步一樣先來做請求
import requests
mydata = {'key':'vaule'}
r = request.post('url', data = mydata) #將資料加入post請求中
import requests
myfile = {'myfile':open('myfile.docx','rb')} #要上傳的檔案
r = requests.post('url',file = myfile) #將檔案加入post請求中
第一次爬蟲
以Google中文版網頁做練習,不過這樣的爬蟲只有印出回應的字串而已,還不能做資料清洗或是定位等事情,並且列印出來的字串不容易閱讀。
import requests
url = 'https://www.google.com.tw/?hl=zh_TW'
r = requests.get(url)
print(r.text)
#output
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="zh-TW"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="IAJX_v5J-wlSTh5O85tq8w">(function(){window.google={kEI:'ac0rY4vZKNHu-AbgtKPoCw',kEXPI:'0,1302536,56873,6059,206,4804,2316,383,246,5,5367,1123753,1197748,380743,16114,17444,1954,9286,22431,1361,283,12033,17583,4998,13228,3847,10622,22741,5081,885,708,1279,2742,149,1103,840,1983,213,4101,3514,606,2023,1777,520,14670,3227,2845,7,5599,28171,1851,2614,13142,3,346,230,6460,148,13975,4,1528,2304,27348,7422,7356,13659,4437,16786,5815,2542,4097,4049,3,3541,1,42160,2,14016,6249,7867,11623,6700,951,1429,14023,14719,4568,6258,23418,1252,5835,14968,4332,20,7464,445,2,2,1,6960,19672,8155,6582,799,14680,1289,873,14802,1,4831,7,1922,9779,19130,12192,4832,1520,6414,5091,3007,984,122,700,4,1,2,2,2,2,5952,2450,6721,238,2085,3065,5930,2348,14,82,949,1759,1182,751,446,1624,5356,1493,1030,2412,922,666,198,4,305,763,220,36,563,987,411,1541,1867,165,182,143,3,3,2,2,400,1073,563,555,1,1520,220,547,15,1645,675,1226,7,1,61,343,547,352,5,39,424,126,384,164,471,7,217,268,313,323,399,286,3,1,597,120,308,57,180,590,35,4,118,104,1548,538,5343604,656,130,5995858,2803379,3311,141,795,19736,1,298,48,1570,83,1,3,3,1,1,1,20728802,33,3219986,4042143,1964,3094,2321,11258,3405,5543',kBL:'x9VE'};google.sn='webhp';google.kHL='zh-TW';})();(function(){
var f=this||self;var h,k=[];function l(a){for(var b;a&&(!a.getAttribute||!(b=a.getAttribute("eid")));)a=a.parentNode;return b||h}function m(a){for(var b=null;a&&(!a.getAttribute||!(b=a.getAttribute("leid")));)a=a.parentNode;return b}
function n(a,b,c,d,g){var e="";c||-1!==b.search("&ei=")||(e="&ei="+l(d),-1===b.search("&lei=")&&(d=m(d))&&(e+="&lei="+d));d="";!c&&f._cshid&&-1===b.search("&cshid=")&&"slh"!==a&&(d="&cshid="+f._cshid);c=c||"/"+(g||"gen_204")+"?atyp=i&ct="+a+"&cad="+b+e+"&zx="+Date.now()+d;/^http:/i.test(c)&&"https:"===window.location.protocol&&(google.ml&&google.ml(Error("a"),!1,{src:c,glmm:1}),c="");return c};h=google.kEI;google.getEI=l;google.getLEI=m;google.ml=function(){return null};google.log=function(a,b,c,d,g){if(c=n(a,b,c,d,g)){a=new Image;var e=k.length;k[e]=a;a.onerror=a.onload=a.onabort=function(){delete k[e]};a.src=c}};google.logUrl=n;}).call(this);(function(){
google.y={};google.sy=[];google.x=function(a,b){if(a)var c=a.id;else{do c=Math.random();while(google.y[c])}google.y[c]=[a,b];return!1};google.sx=function(a){google.sy.push(a)};google.lm=[];google.plm=function(a){google.lm.push.apply(google.lm,a)};google.lq=[];google.load=function(a,b,c){google.lq.push([[a],b,c])};google.loadAll=function(a,b){google.lq.push([a,b])};google.bx=!1;google.lx=function(){};}).call(this);google.f={};(function(){
document.documentElement.addEventListener("submit",function(b){var a;if(a=b.target){var c=a.getAttribute("data-submitfalse");a="1"===c||"q"===c&&!a.elements.q.value?!0:!1}else a=!1;a&&(b.preventDefault(),b.stopPropagation())},!0);document.documentElement.addEventListener("click",function(b){var a;a:{for(a=b.target;a&&a!==document.documentElement;a=a.parentElement)if("A"===a.tagName){a="1"===a.getAttribute("data-nohref");break a}a=!1}a&&b.preventDefault()},!0);}).call(this);</script><style>#gbar,#guser{font-size:13px;padding-top:1px !important;}#gbar{height:22px}#guser{padding-bottom:7px !important;text-align:right}.gbh,.gbd{border-top:1px solid #c9d7f1;font-size:1px}.gbh{height:0;position:absolute;top:24px;width:100%}@media all{.gb1{height:22px;margin-right:.5em;vertical-align:top}#gbar{float:left}}a.gb1,a.gb4{text-decoration:underline !important}a.gb1,a.gb4{color:#00c !important}.gbi .gb4{color:#dd8e27 !important}.gbf .gb4{color:#900 !important}
</style><style>body,td,a,p,.h{font-family:arial,sans-serif}body{margin:0;overflow-y:scroll}#gog{padding:3px 8px 0}td{line-height:.8em}.gac_m td{line-height:17px}form{margin-bottom:20px}.h{color:#1558d6}em{color:#c5221f;font-style:normal;font-weight:normal}a em{text-decoration:underline}.lst{height:25px;width:496px}.gsfi,.lst{font:18px arial,sans-serif}.gsfs{font:17px arial,sans-serif}.ds{display:inline-box;display:inline-block;margin:3px 0 4px;margin-left:4px}input{font-family:inherit}body{background:#fff;color:#000}a{color:#4b11a8;text-decoration:none}a:hover,a:active{text-decoration:underline}.fl a{color:#1558d6}a:visited{color:#4b11a8}.sblc{padding-top:5px}.sblc a{display:block;margin:2px 0;margin-left:13px;font-size:11px}.lsbb{background:#f8f9fa;border:solid 1px;border-color:#dadce0 #70757a #70757a #dadce0;height:30px}.lsbb{display:block}#WqQANb a{display:inline-block;margin:0 12px}.lsb{background:url(/images/nav_logo229.png) 0 -261px repeat-x;border:none;color:#000;cursor:pointer;height:30px;margin:0;outline:0;font:15px arial,sans-serif;vertical-align:top}.lsb:active{background:#dadce0}.lst:focus{outline:none}</style><script nonce="IAJX_v5J-wlSTh5O85tq8w">(function(){window.google.erd={jsr:1,bv:1657,de:true};
var h=this||self;var k,l=null!=(k=h.mei)?k:1,n,p=null!=(n=h.sdo)?n:!0,q=0,r,t=google.erd,v=t.jsr;google.ml=function(a,b,d,m,e){e=void 0===e?2:e;b&&(r=a&&a.message);if(google.dl)return google.dl(a,e,d),null;if(0>v){window.console&&console.error(a,d);if(-2===v)throw a;b=!1}else b=!a||!a.message||"Error loading script"===a.message||q>=l&&!m?!1:!0;if(!b)return null;q++;d=d||{};b=encodeURIComponent;var c="/gen_204?atyp=i&ei="+b(google.kEI);google.kEXPI&&(c+="&jexpid="+b(google.kEXPI));c+="&srcpg="+b(google.sn)+"&jsr="+b(t.jsr)+"&bver="+b(t.bv);var f=a.lineNumber;void 0!==f&&(c+="&line="+f);var g=
a.fileName;g&&(0<g.indexOf("-extension:/")&&(e=3),c+="&script="+b(g),f&&g===window.location.href&&(f=document.documentElement.outerHTML.split("\n")[f],c+="&cad="+b(f?f.substring(0,300):"No script found.")));c+="&jsel="+e;for(var u in d)c+="&",c+=b(u),c+="=",c+=b(d[u]);c=c+"&emsg="+b(a.name+": "+a.message);c=c+"&jsst="+b(a.stack||"N/A");12288<=c.length&&(c=c.substr(0,12288));a=c;m||google.log(0,"",a);return a};window.onerror=function(a,b,d,m,e){r!==a&&(a=e instanceof Error?e:Error(a),void 0===d||"lineNumber"in a||(a.lineNumber=d),void 0===b||"fileName"in a||(a.fileName=b),google.ml(a,!1,void 0,!1,"SyntaxError"===a.name||"SyntaxError"===a.message.substring(0,11)||0<a.message.indexOf("Script error")?2:0));r=null;p&&q>=l&&(window.onerror=null)};})();</script></head><body bgcolor="#fff"><script nonce="IAJX_v5J-wlSTh5O85tq8w">(function(){var src='/images/nav_logo229.png';var iesg=false;document.body.onload = function(){window.n && window.n();if (document.images){new Image().src=src;}
if (!iesg){document.f&&document.f.q.focus();document.gbqf&&document.gbqf.q.focus();}
}
})();</script><div id="mngb"><div id=gbar><nobr><b class=gb1>搜尋</b> <a class=gb1 href="https://www.google.com.tw/imghp?hl=zh-TW&tab=wi">圖片</a> <a class=gb1 href="https://maps.google.com.tw/maps?hl=zh-TW&tab=wl">地圖</a> <a class=gb1 href="https://play.google.com/?hl=zh-TW&tab=w8">Play</a> <a class=gb1 href="https://www.youtube.com/?tab=w1">YouTube</a> <a class=gb1 href="https://news.google.com/?tab=wn">新聞</a> <a class=gb1 href="https://mail.google.com/mail/?tab=wm">Gmail</a> <a class=gb1 href="https://drive.google.com/?tab=wo">雲端硬碟</a> <a class=gb1 style="text-decoration:none" href="https://www.google.com.tw/intl/zh-TW/about/products?tab=wh"><u>更多</u> »</a></nobr></div><div id=guser width=100%><nobr><span id=gbn class=gbi></span><span id=gbf class=gbf></span><span id=gbe></span><a href="http://www.google.com.tw/history/optout?hl=zh-TW" class=gb4>網頁記錄</a> | <a href="/preferences?hl=zh-TW" class=gb4>設定</a> | <a target=_top id=gb_70 href="https://accounts.google.com/ServiceLogin?hl=zh-TW&passive=true&continue=https://www.google.com.tw/%3Fhl%3Dzh_TW&ec=GAZAAQ" class=gb4>登入</a></nobr></div><div class=gbh style=left:0></div><div class=gbh style=right:0></div></div><center><br clear="all" id="lgpd"><div id="lga"><img alt="Google" height="92" src="/images/branding/googlelogo/1x/googlelogo_white_background_color_272x92dp.png" style="padding:28px 0 14px" width="272" id="hplogo"><br><br></div><form action="/search" name="f"><table cellpadding="0" cellspacing="0"><tr valign="top"><td width="25%"> </td><td align="center" nowrap=""><input name="ie" value="Big5" type="hidden"><input value="zh-TW" name="hl" type="hidden"><input name="source" type="hidden" value="hp"><input name="biw" type="hidden"><input name="bih" type="hidden"><div class="ds" style="height:32px;margin:4px 0"><input class="lst" style="margin:0;padding:5px 8px 0 6px;vertical-align:top;color:#000" autocomplete="off" value="" title="Google 搜尋" maxlength="2048" name="q" size="57"></div><br style="line-height:0"><span class="ds"><span class="lsbb"><input class="lsb" value="Google 搜尋" name="btnG" type="submit"></span></span><span class="ds"><span class="lsbb"><input class="lsb" id="tsuid_1" value="好手氣" name="btnI" type="submit"><script nonce="IAJX_v5J-wlSTh5O85tq8w">(function(){var id='tsuid_1';document.getElementById(id).onclick = function(){if (this.form.q.value){this.checked = 1;if (this.form.iflsig)this.form.iflsig.disabled = false;}
else top.location='/doodles/';};})();</script><input value="AJiK0e8AAAAAYyvbecC6mu0GqbcdsPqqhXDesXTsFjCa" name="iflsig" type="hidden"></span></span></td><td class="fl sblc" align="left" nowrap="" width="25%"><a href="/advanced_search?hl=zh-TW&authuser=0">進階搜尋</a></td></tr></table><input id="gbv" name="gbv" type="hidden"
value="1"><script nonce="IAJX_v5J-wlSTh5O85tq8w">(function(){
var a,b="1";if(document&&document.getElementById)if("undefined"!=typeof XMLHttpRequest)b="2";else if("undefined"!=typeof ActiveXObject){var c,d,e=["MSXML2.XMLHTTP.6.0","MSXML2.XMLHTTP.3.0","MSXML2.XMLHTTP","Microsoft.XMLHTTP"];for(c=0;d=e[c++];)try{new ActiveXObject(d),b="2"}catch(h){}}a=b;if("2"==a&&-1==location.search.indexOf("&gbv=2")){var f=google.gbvu,g=document.getElementById("gbv");g&&(g.value=a);f&&window.setTimeout(function(){location.href=f},0)};}).call(this);</script></form><div
id="gac_scont"></div><div style="font-size:83%;min-height:3.5em"><br></div><span id="footer"><div style="font-size:10pt"><div style="margin:19px auto;text-align:center" id="WqQANb"><a href="http://www.google.com.tw/intl/zh-TW/services/">商業解決方案</a><a href="/intl/zh-TW/about.html">關於 Google</a><a href="https://www.google.com.tw/setprefdomain?prefdom=US&sig=K_Ek7YMfFhDYrLrvRxStxw0qT7zNg%3D" id="fehl">Google.com</a></div></div><p style="font-size:8pt;color:#70757a">© 2022 - <a
href="/intl/zh-TW/policies/privacy/">隱私權</a> - <a href="/intl/zh-TW/policies/terms/">服務條款</a></p></span></center><script nonce="IAJX_v5J-wlSTh5O85tq8w">(function(){window.google.cdo={height:757,width:1440};(function(){
var a=window.innerWidth,b=window.innerHeight;if(!a||!b){var c=window.document,d="CSS1Compat"==c.compatMode?c.documentElement:c.body;a=d.clientWidth;b=d.clientHeight}a&&b&&(a!=google.cdo.width||b!=google.cdo.height)&&google.log("","","/client_204?&atyp=i&biw="+a+"&bih="+b+"&ei="+google.kEI);}).call(this);})();</script> <script nonce="IAJX_v5J-wlSTh5O85tq8w">(function(){google.xjs={ck:'xjs.hp.nAt8mkHlvVw.L.X.O',cs:'ACT90oFbSni2diyqWHkZE0NBPwvwM9CUVw',excm:[]};})();</script> <script nonce="IAJX_v5J-wlSTh5O85tq8w">(function(){var u='/xjs/_/js/k\x3dxjs.hp.en.Pa2FzRQfyWU.O/am\x3dAACeAAAkAEAB/d\x3d1/ed\x3d1/rs\x3dACT90oGGOHAG99qraA9hYMsxdyq3TX3bvQ/m\x3dsb_he,d';
var d=this||self,e=function(a){return a};
var g;var l=function(a,b){this.g=b===h?a:""};l.prototype.toString=function(){return this.g+""};var h={};function n(){var a=u;google.lx=function(){p(a);google.lx=function(){}};google.bx||google.lx()}
function p(a){google.timers&&google.timers.load&&google.tick&&google.tick("load","xjsls");var b=document;var c="SCRIPT";"application/xhtml+xml"===b.contentType&&(c=c.toLowerCase());c=b.createElement(c);if(void 0===g){b=null;var k=d.trustedTypes;if(k&&k.createPolicy){try{b=k.createPolicy("goog#html",{createHTML:e,createScript:e,createScriptURL:e})}catch(q){d.console&&d.console.error(q.message)}g=b}else g=b}a=(b=g)?b.createScriptURL(a):a;a=new l(a,h);c.src=a instanceof l&&a.constructor===l?a.g:"type_error:TrustedResourceUrl";var f,m;(f=(a=null==(m=(f=(c.ownerDocument&&c.ownerDocument.defaultView||window).document).querySelector)?void 0:m.call(f,"script[nonce]"))?a.nonce||a.getAttribute("nonce")||"":"")&&c.setAttribute("nonce",f);document.body.appendChild(c);google.psa=!0};google.xjsu=u;setTimeout(function(){n()},0);})();function _DumpException(e){throw e;}
function _F_installCss(c){}
(function(){google.jl={blt:'none',chnk:0,dw:false,dwu:true,emtn:0,end:0,ine:false,injs:'none',injt:0,injth:0,injv2:false,lls:'default',pdt:0,rep:0,snet:true,strt:0,ubm:false,uwp:true};})();(function(){var pmc='{\x22d\x22:{},\x22sb_he\x22:{\x22agen\x22:true,\x22cgen\x22:true,\x22client\x22:\x22heirloom-hp\x22,\x22dh\x22:true,\x22dhqt\x22:true,\x22ds\x22:\x22\x22,\x22ffql\x22:\x22zh-TW\x22,\x22fl\x22:true,\x22host\x22:\x22google.com.tw\x22,\x22isbh\x22:28,\x22jsonp\x22:true,\x22msgs\x22:{\x22cibl\x22:\x22清除搜尋\x22,\x22dym\x22:\x22你是不是要查:\x22,\x22lcky\x22:\x22好手氣\x22,\x22lml\x22:\x22瞭解詳情\x22,\x22oskt\x22:\x22輸入工具\x22,\x22psrc\x22:\x22已從您的「\\u003Ca href\x3d\\\x22/history\\\x22\\u003E網頁記錄\\u003C/a\\u003E」中移除這筆搜尋記錄\x22,\x22psrl\x22:\x22移除\x22,\x22sbit\x22:\x22以圖搜尋\x22,\x22srch\x22:\x22Google 搜尋\x22},\x22ovr\x22:{},\x22pq\x22:\x22\x22,\x22refpd\x22:true,\x22rfs\x22:[],\x22sbas\x22:\x220 3px 8px 0 rgba(0,0,0,0.2),0 0 0 1px rgba(0,0,0,0.08)\x22,\x22sbpl\x22:16,\x22sbpr\x22:16,\x22scd\x22:10,\x22stok\x22:\x22ZjtxvoHJ9WqhldticDCWCaRXhOg\x22,\x22uhde\x22:false}}';google.pmc=JSON.parse(pmc);})();</script> </body></html>
結語
今天介紹也使用了requests
套件,進行了第一次爬蟲的準備跟實作。
果然每件事情第一次做,都會覺得很新奇很好玩!
接下來會繼續往爬蟲的其他功能(例如資料清洗)前進。
明天!
【Day 10】第一次資料清洗-Requests HTML
參考資料
Requests 函式庫https://steam.oxxostudio.tw/category/python/spider/requests.html
Day9-簡單套件介紹 Python Requests
Http Header 自定欄位https://medium.com/@BeemoLin/http-header-%E8%87%AA%E5%AE%9A%E6%AC%84%E4%BD%8D-a53b8fd9d6f2
Python 使用 requests 模組產生 HTTP 請求,下載網頁資料教學https://blog.gtwang.org/programming/python-requests-module-tutorial/